Least-Squares Temporal Di erence Learning
نویسنده
چکیده
TD( ) is a popular family of algorithms for approximate policy evaluation in large MDPs. TD( ) works by incrementally updating the value function after each observed transition. It has two major drawbacks: it makes ine cient use of data, and it requires the user to manually tune a stepsize schedule for good performance. For the case of linear value function approximations and = 0, the Least-Squares TD (LSTD) algorithm of Bradtke and Barto (Bradtke and Barto, 1996) eliminates all stepsize parameters and improves data e ciency. This paper extends Bradtke and Barto's work in three signi cant ways. First, it presents a simpler derivation of the LSTD algorithm. Second, it generalizes from = 0 to arbitrary values of ; at the extreme of = 1, the resulting algorithm is shown to be a practical formulation of supervised linear regression. Third, it presents a novel, intuitive interpretation of LSTD as a model-based reinforcement learning technique.
منابع مشابه
An Analysis of Temporal Di erence Learning with Function Approximation
We discuss the temporal di erence learning algorithm as applied to approximating the cost to go function of an in nite horizon discounted Markov chain The algorithm we analyze updates parameters of a linear function approximator on line during a single endless traject ory of an irreducible aperiodic Markov chain with a nite or in nite state space We present a proof of convergence with probabili...
متن کاملAn Analysis of Temporal - Di erence Learning with Function Approximation 1
We discuss the temporal-di erence learning algorithm, as applied to approximating the cost-to-go function of an in nite-horizon discounted Markov chain. The algorithm we analyze updates parameters of a linear function approximator on{line, during a single endless trajectory of an irreducible aperiodic Markov chain with a nite or in nite state space. We present a proof of convergence (with proba...
متن کاملAnalytical Mean Squared Error Curves in Temporal Di erence Learning
We have calculated analytical expressions for how the bias and variance of the estimators provided by various temporal di erence value estimation algorithms change with o ine updates over trials in absorbing Markov chains using lookup table representations. We illustrate classes of learning curve behavior in various chains, and show the manner in which TD is sensitive to the choice of its steps...
متن کاملThe Mean and the Variance Matrix of the 'fixed' Gps Baseline
In this contribution we determine the rst two moments of the ' xed' GPS baseline. The rst two moments of the ' oat' solution are well-known. They follow from standard adjustment theory. In order to determine the corresponding moments of the ' xed' solution, the probabilistic characteristics of the integer least-squares ambiguities need to be taken into account. It is shown that the ' xed' GPS b...
متن کاملPii: S0165-1684(01)00098-6
A popular technique for time delay estimation is to use an FIR #lter to model the time di-erence and the #lter weights are interpolated with a sinc function to obtain the delay estimate. However, the sinc interpolator requires a su/ciently long #lter length for accurate delay estimation. In this paper, we propose to process the #lter weights via a least-squares-based method in order to acquire ...
متن کامل